Multi-class glioma segmentation on real-world data with missing MRI sequences: comparison of three deep learning algorithms

This study tests the generalisability of three Brain Tumor Segmentation (BraTS) challenge models using a multi-center dataset of varying image quality and incomplete MRI datasets. In this retrospective study, DeepMedic, no-new-Unet (nn-Unet), and NVIDIA-net (nv-Net) were trained and tested using manual segmentations from preoperative MRI of glioblastoma (GBM) and low-grade gliomas (LGG) from the BraTS 2021 dataset (1251 in total), in addition to 275 GBM and 205 LGG acquired clinically across 12 hospitals worldwide. Data was split into 80% training, 5% validation, and 15% internal test data. An additional external test-set of 158 GBM and 69 LGG was used to assess generalisability to other hospitals’ data. All models’ median Dice similarity coefficient (DSC) for both test sets were within, or higher than, previously reported human inter-rater agreement (range of 0.74–0.85). For both test sets, nn-Unet achieved the highest DSC (internal = 0.86, external = 0.93) and the lowest Hausdorff distances (10.07, 13.87 mm, respectively) for all tumor classes (p < 0.001). By applying Sparsified training, missing MRI sequences did not statistically affect the performance. nn-Unet achieves accurate segmentations in clinical settings even in the presence of incomplete MRI datasets. This facilitates future clinical adoption of automated glioma segmentation, which could help inform treatment planning and glioma monitoring.

datasets.This facilitates future clinical adoption of automated glioma segmentation, which could help inform treatment planning and glioma monitoring.Clinically accurate segmentation and longitudinal volumetric analysis of glioma are helpful in treatment planning and response monitoring 1,2 .Volumetric analyses are not commonly used in clinical practice and are generally limited to crude 2D measurements in clinical trials.While this is the current standard for treatment response evaluation in trials 3 , poor prognosis and heterogeneous treatment response encourage quantitative analysis of tumors, especially for glioma due to their varied morphometry and infiltrative nature [4][5][6][7] .It is these two characteristics of glioma, along with heterogenous contrast enhancement, that complicate their manual delineation and further highlight the need for automated segmentation protocols in the clinical setting [8][9][10] .Indeed, baseline imaging and volumetric measurements are of particular importance to neurosurgeons and radiotherapists because tumor volume and functional anatomy are key factors for both risk and prognostic assessment of patients 11,12 .

Abbreviations
The VASARI features have illustrated the importance of extracting such quantitative measures, but automation of segmentation and subsequent feature extraction is needed to enable widespread application 13,14 .Automated quantification could provide improvements in reporting time, treatment response monitoring, and overall efficiency across a neuroradiological service, but is dependent upon technical and clinical validation of the methods [15][16][17] .Deep learning has emerged as the preferred method for automated tumor segmentation 6,[18][19][20][21] .Ideally, the clinical environment requires a fast algorithm that is robust to scanner variation and missing MRI sequences.
Since 2012, the annual Brain Tumor Segmentation (BraTS) Challenge has compared the performance of numerous AI-driven glioma segmentation algorithms 18,22 .However, these algorithms are trained and assessed on a highly curated dataset optimised for quality: each subject has a complete dataset of high-quality pre-and post-contrast T1-weighted (T1w and T1c, respectively), T2-weighted (T2w), and T2-weighted fluid-attenuated inversion recovery (FLAIR) images, which does not accurately reflect the realities of clinically-acquired MRI data.For example, a recent study using a model (DeepMedic) trained exclusively on BraTS data, achieved a median Dice similarity coefficient (DSC) of 0.81 on BraTS test data but only 0.49 on external clinical data 23 .
The aim of the current study was to determine the performance and generalisability of three of the highestperforming models at recent BraTS challenges [24][25][26] on real-world clinical data.Models have been trained with both BraTS data and another multi-centre dataset obtained from 12 different hospitals worldwide: the PICTURE project (www.pictu repro ject.nl) [27][28][29][30][31] .An external test set comprised of PICTURE data from hospitals not used in the training and validation phases was employed to assess the clinical applicability and determine the need for retraining models on a hospital's own data.Furthermore, we use sparsified training, to account for missing sequences 23 , and assessed performance in patients with incomplete MRI datasets.

Materials and methods
All patients provided informed consent and data were obtained and anonymized according to the General Data Protection Regulation and Health Insurance Portability and Accountability Act.Local Institutional Review Board approval was obtained for all primary studies.For the the VU medical center Amsterdam the institutional review board approved of the experiments in this study under case nr.2014.336.Of the patients involved in the current study, 40 were previously studied in an inter-rater agreement study by Visser et al. 29 .The 275 Glioblastoma patients from the PICTURE dataset were previously used in a study focused on robust tumor core segmentation in glioblastoma patients Eijgelaar et al. 23 .The study was carried out in concordance with the Checklist for Artificial Intelligence in Medical Imaging (CLAIM) guidelines 32 .

Missing scans
Both datasets contain pre-operative T1w, T1c, T2w, and FLAIR images.However, in the PICTURE dataset some patients had missing sequences, see Table 1 for a breakdown and Fig. 1 for examples of subjects with a missing FLAIR or T2.Only patients with at least T1c and either T2w or FLAIR were included to be able to manually segment all tissue classes.Out of 1731 total cases, there were 204 missing pre-contrast T1w, 186 missing T2w, and 19 missing FLAIR, see "Model training and testing" for details of the sparsified training used to account for missing sequences.

Pre-processing
T1w, T2w, and FLAIR images were rigidly registered to the T1c image.Subsequently, the T1c was registered to the SRI24 atlas (https:// www.nitrc.org/ proje cts/ sri24/) 33 using an affine transformation.The same transform was applied to the other MR sequences (T2w, FLAIR, T1w).All modalities were resampled to 1mm isotropic voxels in the SRI24 atlas space, the rigid and affine registrations were applied using a single interpolation step.All registrations and resampling were conducted using the Advanced Normalization Tools (ANTs) 34 .N4 bias field correction 35 was used and skull stripping was performed with the "HD-bet" algorithm (https:// github.com/ MIC-DKFZ/ HD-BET) 36 .

Manual segmentations
For the PICTURE data, 275 GBM and 205 LGG cases were manually segmented into 3 classes consistent with the BraTS challenges -whole tumor (WT), tumor core (TC), and enhancing tumor (ET), see Fig. 1.The WT defines the full extent of the tumor, including the tumor core and oedema, indicated by hyperintensity on FLAIR and T2w.The TC is the main body of the tumor and most likely area of resection.The TC includes the enhancing tumor (ET) and necrosis.www.nature.com/scientificreports/Manual segmentations were carried out according to the VASARI Research Project (https:// wiki.cance rimag ingar chive.net/ displ ay/ Public/ VASARI+ Resea rch+ Proje ct).One rater (HP) with 9 years of brain MRI manual segmentation experience performed segmentations under the supervision and approval of an expert neuroradiologist (FB), using the semiautomatic SmartBrush tool (BrainLab, Feldkirchen, Germany).The rater's performance was in line with experts 29 .All segmentations were exported on the T1c image.The segmentation was resampled to SRI24 atlas space using the same transform from the T1c to SRI24 registration.

Quality control
Visual quality control checks were carried out for incomplete coverage, skull stripping, registration errors, and incomplete segmentations.Overview images were generated to facilitate quality control.The images show the same axial, sagittal, and coronal view for all patients to assess the registration quality, as well as an axial view of the center of the tumor to verify the segmentation.Seven scans were not included due to poor image quality and five due to severe registration errors (as illustrated in Appendix 2).

Deep learning segmentation models
Three algorithms were selected for this study based on high performance in recent BraTS challenges 18,22,37 , availability of a user-friendly and reproducible implementation online, and the uniqueness of the algorithm, see Table 2.   23 .This study showed that performance drops substantially if not all sequences are available.This could be solved by inserting empty (zero-filled) scans in place of missing sequences, see the first column of Fig. 1.During training, the T1w, T2w, and FLAIR were additionally set to zero with independent probabilities of 20%, in line with the estimated frequency of missing sequences in the clinical setting 23 .We used the validation data to confirm convergence, the hyperparameters of all models were kept at the default values, as reported in the associated papers, or as used in the published code repositories (Table 2).All model training and testing was carried out using a machine equipped with an AMD Ryzen 9 3900X 12 core processor, 64GB RAM, and 1 NVIDIA RTX3090 (24GB) graphics processing unit (GPU).

Model performance assessment
In line with the BraTS challenges, tumor segmentations for each algorithm were assessed using median and inter-quartile range Dice similarity coefficient (DSC) 38 and Hausdorff distance (HD) 39 for all experiments.Results were generated using methods described by Taha and Hanbury 40 and associated software.

Experiments and statistical analyses
Four separate experiments were performed.In experiment 1, DSC and HD from the internal and external test sets were analyzed separately for each model/tumor class using a paired two-tailed t-test to assess differences between each model.DSC and HD were also compared in the following experiments using independent samples (Welch's) t-tests: experiment 2-GBMs vs LGGs on the internal and external test set to assess differences in performance on the differing tumor grades; experiment 3-GBMs and LGGs from the internal vs external test sets to assess the change in performance when segmenting external hospital data not previously seen by the models; and experiment 4-GBM and LGG patients with incomplete vs complete imaging datasets in the external test set to assess the change in performance when segmenting patients with incomplete imaging datasets from external hospital data not yet seen by the models.A single Bonferroni correction was applied for each experiment 41 .Outliers in box plots and overall outlier rates for each model and segmentation class were calculated using the IQR × 1.5 rule, i.e. outside [Q1 − 1.5 × IQR; Q3 + 1.5 × IQR] 42,43 .

PICTURE dataset segmentations
See Fig. 2 for ground truth manual segmentation and automated segmentation examples from all three models.See Appendix 3 and 4 for GBM and LGG segmentation contours on a 4 T T1c scan along with two human experts' manual segmentation, for all tumor classes.

Experiment 1-Segmentation performance on both test sets-which model achieved the best metrics?
Box plots showing median DSC and HD for all models and tumor classes on the internal and external test sets are presented in Fig. 3. nn-Unet achieved significantly higher DSCs than nvNet and lower HDs than both nvNet and DeepMedic for all tumor classes on both the internal and external test sets (p values < 0.0027).The raw metrics are displayed in Table 3.

Experiment 2-Segmentation performance on GBM vs LGG
Comparing performance between GBM and LGG, nn-Unet continued to provide the best quality results for both tumor grades, statistical comparisons are reported in Table 4.However, overall segmentation performance on the LGG was notably weaker than for GBM across all models, see Fig. 4 for box plots.DeepMedic showed the largest decrease in performance across all tumor classes.

Experiment 3-GBM segmentation performance on an external test set-do models need to be retrained for new hospital data?
As shown in experiment 1, nn-Unet produced the most favourable results when compared to the other models on both test sets.Table 3 shows the DSC and HD results for the internal test set (15 GBM, 23 LGG) and the external test set (158 GBM, 69 LGG) comprised of cases from hospitals not included in the training data, see Fig. 5 for box plots.nn-Unet showed the smallest absolute decrease and increase in respectively DSC and HD from the internal to the external test set for GBM WT, (DSC internal: 0.97, external: 0.95, p < 0.001*, HD internal: 7.34, external: 9.11, p = 0.958).All models' DSC were slightly reduced on WT and TC for both HGG and LGG but remained within clinically-acceptable range 18,22,44 .However, the segmentation performance of ET improved in the external dataset for all models.5 separately.For GBM, nn-Unet achieved the highest DSCs and lowest HD for all tumor classes on both incomplete and complete scans, with the exception of nvNet reaching a slightly lower HD on TC for incomplete scans.There were no statistically significant differences between the two groups for all models.

Outlier rates
For outliers according to DSC, based the IQR × 1.5 rule 42,43

Discussion
In this study, we compared the performance of three of the top performing BraTS challenge deep learning models for automated brain tumor segmentation in an external multi-centre hospital dataset (https:// www.pictu repro ject.nl).We extended the valuable work of the BraTS challenge by increasing the number of training cases and using a less strictly curated, and therefore more clinically-relevant, dataset [27][28][29][30][31] .Subsequently, we tested the generalisability of the three models on an external test set comprised of data from hospitals not used in model training.Akin to the realities of clinical assessment, we further show the utility of these models when segmenting incomplete MRI datasets, due to acquisition protocols or patient-specific circumstances, sparsified training was applied to account for missing pulse-sequences 23 .Our results demonstrate that nn-Unet, when supplemented sparsified training, produces high DSC and low HD for glioma segmentations in real-world hospital data.

Clinical implications
Manual segmentations are the current gold standard in clinical practice, where an inter-rater variability of 0.74-0.85DSC has been previously reported in the BraTS challenge 18,22,44 .All models' median DSCs for both test sets were within this "clinically acceptable" inter-rater agreement range.However, manual segmentations are not a time-efficient process.Semi-automatic multi-class glioma segmentation using BrainVoyagerTM QX, ITK-Snap and 3D Slicer is reported to take an average of 18-41 min per patient 45 .On the whole, automated inference times in the current study were considerably lower than these reported semi-automated segmentation times, see Appendix 5 for all results.nn-Unet takes approximately 37 min of computer time to produce a segmentation using a CPU or only 4.5 min when a GPU is available, versus 18-41 min of human rater time.
The majority of median DSCs were within this clinically-acceptable range of 0.74-0.85 18,22,44when testing on an external test set with missing pulse-sequences, but there was a decrease in DSC for all models on the WT and TC, but not for the ET.The TC yielded the most accurate segmentations for both DSC and HD across models.Since the TC is the main body of the tumor and the most likely area of resection, our findings suggest that using www.nature.com/scientificreports/nn-Unet with sparsified training may be an optimal combination for pre-surgical applications, with acceptable results in 97.47% of patients, based on the outlier rate of 2.53% for nn-Unet.nn-Unet yielded the fewest outliers in all categories across all models.Furthermore, it showed the smallest reductions in segmentation performance on the external test set.There were also no statistically significant changes in segmentation quality when comparing complete versus incomplete imaging datasets.In line with other recent work, this suggests not all of the MRI sequences are necessary when models are augmented using sparsified training, or similar methods 23,46,47 .However, the lower WT DSCs indicate a heavier reliance on a full set of MRI sequences for WT segmentation, which is plausible given the hyperintensity of oedema on FLAIR and T2w.Previous studies have also used generative adversarial networks (GAN) to synthesise missing sequences with very promising results 48 , therefore direct comparison of this approach and the sparsified trained used in the current study is encouraged.www.nature.com/scientificreports/discrepancy in these findings demonstrates the value and relevance of testing models on unseen hospital-quality data with missing sequences, as we have in the current study.

Limitations
We performed segmentations in line with the same definitions of the BraTS challenge in order to facilitate comparison, see Fig. 1.However, these definitions may not be those used in the clinical setting.In the BraTS challenge, the WT includes oedema and associated infiltrations but in reality neurosurgeons and neuroradiologists would more often classify the edge of the "tumor core" as the clinical definition of the "whole tumor", i.e. the enhancing and non-enhancing part of the core and its associated necrosis, not including oedema.While this definition might be a better representation of the truth, current MRI techniques make it very difficult to distinguish between oedema and non-enhancing infiltrative tumor.Further research is needed to accurately distinguish between non-enhancing tumor and oedema.Depending on the intended use case for automated glioma segmentations, having a less subjective, more consistent measurement may generate a more accurate representation of true tumor infiltration, and the associated increased (inter-rater) variability.The WHO glioma classification have been updated in 2021: WHO CNS5 has some variations by further advancing the role of molecular diagnostics in the classification of CNS tumors, but remains rooted in its established methods of histology and immunohistochemistry in tumor characterisation 49 .The classification of GBM and LGG is very relevant to a model trained on combined LGG and GBM data, especially when it works on all gliomas.Furthermore, we did not target hyperparameter optimisation for the sparsified training, nor did we make specific architecture optimisations for training and testing these models on a much larger dataset.Peak performance may be improved by doing so, but we chose not to tweak hyperparameters in order to promote generalisability.

Future work
In our study, we have only used pre-operative scans, while post-operative and longitudinal scans are also clinically relevant for radiotherapy planning, quantitative follow-up, and automatic growth detection; however, pre-operative baseline measurements are required for these assessments.Future work should follow the BraTS challenge latest aims and include disease progression monitoring and overall survival prediction.Furthermore, the current approach relied on having at least the T1c scan available.While this is a safe assumption for most retrospective cohorts, this may be different for future cohorts due to ongoing efforts to reduce gadolinium use 50 .To support these sequences new models would have to be trained, however we have shown that sparsified training provides a simple solution to train models that are flexible to the available sequences.
The tested networks all use very different implementations, making it difficult to pinpoint which differences between the models best explain the observed performance differences.To gain a better understanding of which properties most affect performance, future development should focus on consolidating different models within a single framework and applying and testing changes gradually.
We have shown that sparsified training offers a simple solution to missing sequences that is easy to implement for different network architectures and frameworks.While dealing with missing sequences is important, and allows for the inclusion of larger (retrospective) cohorts, improving the availability of all sequences for future patients would tackle the problem at the root.

Figure 1 .
Figure 1.Sample images of PICTURE dataset.Ground truth manual segmentation for a GBM patient with missing FLAIR scan (top row) and one with missing T2w (bottom row), see "Model training and testing" for details of sparsified training which is used to account for missing sequences.Whole Tumor (WT) in green.The WT defines the full extent of the tumor, including the tumor core and oedema, indicated by hyperintensity on FLAIR and T2w.Tumor Core (TC) in red.The TC is the main body of the tumor and most likely area of resection.The TC includes the enhancing tumor (ET) and necrosis.The ET is shown in yellow.

Figure 2 .
Figure 2. GBM patient from the PICTURE dataset with missing FLAIR scan.Whole Tumor (WT-green) is the full extent of the tumor, including the tumor core, non-enhancing tumor and oedema, indicated by hyperintensity on FLAIR and T2w.Tumor Core (TC-red) is the main body of the tumor and most likely area of resection.The TC includes the enhancing tumor (ET-yellow) and necrosis.DSCs in this case for nn-Unet were WT = 0.93, TC = 0.94, ET = 0.83; nvNet WT = 0.89, TC = 0.92, ET = 0.80; and DeepMedic WT = 0.81, TC = 0.85, ET = 0.81.

Figure 3 .
Figure 3. Box plots showing DSC and HD in internal-and external-test sets for all models and tumor classes.Left plots show the test set performance (n = 226) and the right plots show the performance in the external test set (n = 277).

Figure 4 .
Figure 4. Box plots showing DSC and HD in LGG and HGG patients for all models and tumor classes.upper row shows DSC and the bottom row shows HD for all models and tumor classes on GBM (n = 15 test + 158 external test, on the right) vs LGG (n = 23 internal test + 69 external test, on the left) patients.

Figure 5 .
Figure 5. Box plots showing DSC and HD for all models and tumor classes on the internal test set.(a) shows GBMs (n = 15 internal test set cases plus 23 BraTS test set GBM cases, upper left plots) and external test set GBMs (n = 158, upper right plots).(b) shows LGG (n = 23 internal test, lower left plots + 69 external test, lower right plots).
We used manual segmentations of preoperative imaging of 1251 gliomas (unspecified mix of GBM and LGG) from the BraTS 2021 dataset and 275 GBM and 205 LGG (median age, 63.7 IQR [54.3-72.0]years; median survival, 323 [142-609] days; surgery extent: 348 resections, 83 biopsies, 49 unknown) from the PICTURE project.The PICTURE dataset was collected across 12 hospitals worldwide, all patients of at least 18 years old with a newly-diagnosed LGG, or GBM at first-time surgery between 1/1/2012 and 12/31/2013 were included.Since the PICTURE data was collected in 2012 and 2013, the classification of GBM and LGG was in line with WHO 2007 criteria.Demographics for the PICTURE dataset are documented in Appendix 1 of the supplementary material.

Table 1 .
Breakdown of data used in this study from the BraTS and PICTURE datasets (https:// www.pictu repro ject.nl), and missing data totals from each hospital, as well as a breakdown of the train, validation, test, and external test sets.

Table 2 .
Summary of the deep learning algorithms tested in this study.Models were trained with three-class segmentations (WT, TC, ET) for each tumor.The scans were randomly split in 80% training, 5% validation, and 15% internal test data, see Table1.Test data was used to assess the performance of each model.Alongside the 15% internal test data, models were further assessed using an external test set of 158 GBM and 69 LGG patients from PICTURE hospitals not included in the training data, herein referred to as the external test set.This helped to gauge the generalisability and determine the future need for retraining algorithms on a new hospital's unseen data.In order to address missing sequences in the training data (Table1), sparsified training was applied for all algorithms 25,52ast to U-net, DeepMedic does not have an up-sampling 'side' .It predicts 1 × 1x1 voxels based on a high-and low-res input of 17 × 17x17 voxel, the low-res input uses a down sampled version of the image25,52(https:// github.com/ deepm edic/ deepm edic) nn-Unet ('no-new-Net') A U-net network architecture using a 2D, 3D and a cascaded U-net.Three U-net structures are trained simultaneously, and the best trained model is automatically selected Vol.:(0123456789) Scientific Reports | (2023) 13:18911 | https://doi.org/10.1038/s41598-023-44794-0www.nature.com/scientificreports/Model training and testing

Table 3 .
Median DSC and HD for all models and tumor classes on the internal test set GBMs (n = 15) and external GBMs test set (n = 158).Bold font indicates most favourable score in each scenario.Bonferroni adjusted p values at < 0.0027 comparing the performance of models on test set vs external test set were considered significant and are denoted by asterisk, *

Table 4 .
DSC and HD for all models and tumor classes on all GBM's (n = 173) and all LGGs (n = 92) in the internal test set plus external test set combined.Bold font indicates best score in each scenario.Bonferroni adjusted P values < 0.0027 comparing the performance of models were considered significant and are denoted by asterisk, *

Table 5 .
Median DSC and HD for all models and tumor classes on patients in the external test set with missing sequences (n = 44GBM + 55LGG) and subjects with complete scans (n = 114GBM + 14LGG).Bold font indicates best score in each scenario.